\section{Architecture}\label{sec:architecture}
The core mechanism used to separate subjects is Intel's hardware-assisted
virtualization technology VT-x\index{VT-x}. The kernel executes in VMX root
mode, while subjects run in VMX non-root mode. This shields the kernel from
access by subjects. Figure \ref{fig:architecture-overview} illustrates the
basic system architecture: the Muen kernel executes two isolated subjects that
have no access to any kernel resources. The native subject is a bare bones
application and the second subject is a virtual machine (VM\index{VM}) type
subject, e.g. an operating system.

\begin{figure}[h]
	\centering
	\input{graph_arch_overview}
	\caption{Architecture overview}
	\label{fig:architecture-overview}
\end{figure}

Resource assignment to subjects is static and done prior to the execution of
the system. All available resources provided by a platform running the Muen
kernel must be specified in a system policy. Similarly, scheduling plans and
all subjects running on the system are part of the policy.

This relieves the kernel from dynamic resource management and avoids the
associated complexity, which greatly simplifies the kernel design and in turn
its implementation.

The policy is described in detail in section \ref{subsec:policy}.

\subsection{Subject Execution}
The kernel initializes the subject execution environment and state according to
the subject specification. The execution is started from the subject entry
point and continues for a predefined period of time. Once the time has passed,
a trap into the kernel occurs and the current state of the subject is saved for
later resumption.

A subject is constrained to the environment specified by the subject profile
and the resources assigned to it by the policy. If a subject performs an
illegal resource access or an operation not allowed by the profile, a trap
occurs and the kernel is invoked. The kernel can then determine the cause for
the transition and handle the condition according to policy.

\subsubsection{Subject Monitor}
When a subject tries to access resources such as devices that are emulated, a
system component must perform the necessary actions and change the subject's
system state accordingly. This is to give the subject the impression that it
has unrestricted access to a device while in reality the necessary operations
are effectively emulated by another component.

In a system based on the Muen kernel, this monitoring function can be
implemented by a subject termed subject monitor. Such a monitoring subject can
be given access to the state of another subject, effectively allowing it to
change the system state of the monitored subject. A so called \emph{trap
table}, which is part of the subject specification, allows to specify that
execution should be handed over to a subject monitor when a trap occurs.

Historically this monitoring function has been implemented as part of the VMM
of the virtualization system. Since emulation operations can be very involved
and quickly grow in complexity, it is very desirable to extract this
functionality from the separation kernel and thus from the TCB. Since subjects
are confined by the separation kernel the subject monitor concept achieves this
property. This approach has been pioneered by NOVA, see
\cite{Steinberg:2010:NMS:1755913.1755935}.

\subsection{Policy}\label{subsec:policy}
To best fulfill the requirements, a system using the Muen kernel uses static
resource assignment and system specification. The main idea is to have a
complete description of the system including all resources such as system
memory, devices and subjects in the form of a policy\index{policy}. The defined
resources can be assigned to subjects. Since the resources and subjects are
fixed at integration time, the policy can be analyzed and validated prior to
execution.

The following list encompasses the system policy:
\begin{itemize}
	\item Hardware resources
	\item Kernel specification
	\item Subject specifications
	\item Scheduling information
\end{itemize}

\subsubsection{Memory}
All memory resources of a system are static and explicitly specified in the
system policy. Apart from a few special memory regions, such as the application
processor trampoline\index{trampoline} (see section \ref{subsec:init}), there
are no implicitly allocated data structures. This allows one to determine the
exact memory layout of the final system at integration time.

The kernel and each subject specifies its memory layout. The memory layout
defines which physical memory ranges are accessible, their location in the
virtual address space of the binary and additional attributes. These attributes
define read/write and executable properties of the memory region and the caching
behavior.

Since the layout is controlled by memory management data structures,
each subject has its own distinct set of page tables. To assure that a subject
cannot alter the memory layout, it must not have access to any page tables,
including its own. This is achieved by not mapping memory management structures
into the address space of any subject.

Figure \ref{fig:phys-mem-layout-example} illustrates the physical memory
structure of an example system. The hexadecimal values on the left side are
physical addresses in memory.

\begin{figure}[h]
	\centering
	\input{graph_phys_mem_layout_example}
	\caption{Physical memory layout example}
	\label{fig:phys-mem-layout-example}
\end{figure}

The memory region containing kernel code and data starts at address
\texttt{0x100000} (1 MB) and has a size of 112 KB. It is followed by a gap of
unallocated memory and then the "subject descriptors"\footnote{Internal kernel
array used to store subject states} and "$\tau0\rightarrow$ kernel interface"
regions which are each 4 KB in size.  Memory management data structures of the
kernel start at address \texttt{0x200000} (2 MB) and expand over 16 KB. The
memory layout of subject $\tau0$ encompasses three regions ranging from
\texttt{0x210000} to \texttt{0x0219fff}.

Since the address space of a subject cannot change, the page tables are static
as well, which means they can be generated in advance according to the relevant
information in the system policy. Whenever a subject is executed on the CPU, the
kernel directs the MMU to use the corresponding paging structures. The hardware
memory management mechanism then enforces the address translations specified by
the page tables, ultimately restricting the subject to the virtual address space
declared by the policy. Figure \ref{fig:virt-mem-layout-example} illustrates the
memory layout of an example subject. The values on the left are again physical
addresses, the ones on the right are virtual addresses.

\begin{figure}[h]
	\centering
	\input{graph_virt_mem_layout_example}
	\caption{Virtual memory layout of example subject}
	\label{fig:virt-mem-layout-example}
\end{figure}

\subsubsection{Devices}
A device can be modeled as a collection of system resources such as I/O ports,
system memory for memory-mapped IO (MMIO\index{MMIO}) or hardware interrupts. A
PS/2 keyboard for example is composed of two I/O ports and a hardware
interrupt.  In order for subjects to interact with a device, it must be able to
access the I/O ports and process the interrupts raised by the device. Thus a
policy device specification links a device name with a set of hardware
resources. Assignment to subjects is done via references to devices in the
subject specification.  Devices that are not allocated to a subject are not
accessible during the runtime of the system.

\subsubsection{Processor}
Information about the processor of the system must also be specified in the
policy, since the number of available logical CPUs\index{logical CPU} for
example is crucial for scheduling. This information is used to verify the
consistency of the scheduling plan with regards to the processor that the
system is executing on.

\subsubsection{Subject Specification}
All subjects of a system must be specified as part of the policy. The subject
specification\index{subject specification} defines the resources and properties
of a given subject. Each subject is identified by a unique ID and name. The
name is used for subject references in the policy, e.g. trap table entries
specify a destination subject which is given by name. Most subject data
structures are organized in static arrays and the ID can be used as an index to
get the information associated with a specific subject.

The policy also defines what execution environment the kernel must provide by
assigning a profile to each subject. Initial values for parts of the execution
environment (stack and instruction pointer) must be provided by the subject
specification and also the binary file constituting the executable code of the
subject and its memory location must be defined.

Another important part is the assignment of resources. This includes the
designation of the executing CPU, references to granted hardware devices and
the memory layout of the virtual address space. Access to I/O ports and
model-specific registers (MSRs\index{MSR}) is part of the subject specification
as well.

There are two tables that specify how certain conditions caused by a subject
are to be handled by the kernel: the trap and event tables.

The \emph{trap table}\index{trap table} designates which destination subject is
in charge of handling a specific trap and what interrupt vector should be
injected, if any.  It defines to which target subject execution control is
transferred when a trap occurs. The optional injection of an interrupt
indicates the trap kind to the destination subject, so it can distinguish
different causes for traps.

Events, which are described in section \ref{subsubsec:events}, are specified in
the \emph{event table}\index{event table}. An event is a trap that is
explicitly triggered by a subject. The event table specifies how the kernel is
supposed to handle such events.

Both tables are part of the subject policy because the information is
subject-specific. This allows for the configuration of different reactions to
traps depending on the subject causing the condition.

\subsection{Inter-Subject Communication}
The Muen kernel provides two main mechanisms for information flow between
subjects: shared memory and events. The first mechanism can be used to share
arbitrary data between subjects while events are a method to pass notifications
between subjects in the form of injected interrupts.

All communication paths, be it shared memory regions or events, are declared as
part of the subject specification in the system policy. This means that all
channels that provide a way for subjects to exchange information are explicit
and cannot change during the runtime of the system. This allows the validation
of inter-subject information flows prior to execution.

\subsubsection{Shared Memory}
The main mechanism for subjects to exchange data is to specify a common shared
memory region in the system policy. A memory region is shared if two or more
subject memory layouts specify a region that maps the same physical memory range
into the respective address spaces.

Access rights for memory regions (e.g. write access) are part of the subject
memory layout. This allows one to specify a direction of information flow for
memory regions, e.g. by granting write access to the source subject exclusively
and giving the destination subject read-only access.

Since memory management is static and exclusively governed by the system policy,
the kernel does not need to make special provisions. The kernel is oblivious to
shared memory regions, it simply installs the page tables generated by the
policy compiler.

The kernel does not provide a higher level abstraction than shared memory.
Subjects must implement their own protocol, such as message passing, based on
shared memory regions and the event mechanism.

\subsubsection{Events}\label{subsubsec:events}
An event\index{event} is caused by a subject and can be used to either deliver
an interrupt or hand over execution to another subject. There are two kind of
events: handover and interrupt events. The subject triggering an event is
called the source subject while the receiver is called the destination subject.
Events facilitate the implementation of a simple inter-subject notification
mechanism and enable event-driven services.

\emph{Handover events}\index{handover event} allow voluntary transfer of
execution control with optional interrupt delivery to a destination subject.
The intended use for this event kind is to allow subject monitors to resume
their monitored subject. A handover is achieved by replacing the source with
the destination subject in the system's scheduling plan. The interrupt that is
optionally injected can be used as an indicator for the cause of the handover,
which allows the destination subject to quickly react without having to inspect
the triggering subject's state.

\emph{Interrupt events}\index{interrupt event} simply inject an interrupt to a
destination subject without execution handover. This allows a subject to send a
notification to a destination subject in the form of an interrupt.

Events are explicitly triggered by a given source subject and are declared as
part of the subject specification. The subject policy contains a list of event
table entries which state the event number, destination subject and interrupt
number to inject. Event numbers must be unique since they are used to identify a
specific event.

To trigger an event, a subject provokes a trap into the kernel, passing the
event number as a parameter. The kernel then consults the subject's event table
to determine the destination subject. If the event number is invalid (e.g. not
in the range of the subject's event table) the event is ignored and the
execution of the subject is resumed.

\subsection{Exceptions}\label{subsec:design-exceptions}
We distinguish hardware exceptions\index{exceptions} that occur in VMX non-root
mode, while executing a subject, and in VMX root mode when the kernel is
operating.

Use of the SPARK programming language with the ability to prove the absence of
runtime errors means that exceptions are not expected during regular operation
in VMX root-mode. If for some unforeseen reason (e.g. non-maskable interrupt
NMI) an exception occurs, it indicates a serious error and the system is halted.

In the case of an exception being caused by the execution of a subject, the kind
of exception handling depends on the profile of the running subject. If a native
subject raises an exception, a transition to VMX root-mode occurs with the exit
reason indicating the cause. The kernel then consults the subject's trap table
to determine which subject is in charge of handling the condition.

VM subjects are able to perform their own exception handling. Thus if such a
subject raises an exception, it is delivered to the subject's exception handler
via the IDT (see section \ref{subsec:exceptions-interrupts}).

A trap occurs if the subject is somehow not able to handle the exception
properly and a triple fault occurs (e.g. a nested exception occurs in the
exception handler and the subject has not installed a double fault handler).
The trap is then processed by the kernel like any other trap caused by the
subject by using the trap table to schedule the destination subject according to
the policy.

\subsection{Interrupts}
Devices can generate hardware interrupts\index{interrupts} that must be
delivered to the subject that controls the device. A device specification in
the system policy defines which hardware interrupt it generates. Devices are
assigned to subjects by means of device references in the subject specification
part.

Since resource allocation is static, a global mapping of hardware interrupt to
destination subject can be compiled at integration time. The kernel then uses
this mapping at runtime to determine the destination subject that constitutes
the final recipient of the interrupt.

Each subject has a list of pending interrupts. An interrupt is forwarded to a
subject by appending an entry to the interrupt list of the destination subject.
When the execution of a subject is resumed, the kernel consults this list and
injects the interrupt completing its delivery.

Spurious or invalid interrupts that have no valid interrupt to subject mapping
are ignored by the kernel.

\subsection{Multicore}\label{subsec:multicore}
Modern computers have an increasing number of CPU cores per processor. To
utilize the hardware to its full potential, the separation kernel must make use
of all available cores. In a multicore\index{multicore} system, a physical CPU
package provides more than one processor core in a single package.
Additionally, systems equipped with Intel's Hyper-Threading Technology
(HTT\index{HTT}) have two or more \emph{logical CPUs} per core. A logical CPU
is an execution unit of the processor that runs an application or a kernel
process.

In MP systems, one processor termed \emph{bootstrap processor}\index{bootstrap
processor}(BSP) is responsible for system initialization, while the other
processors, called \emph{application processors}\index{application processor}
(APs), wait for an initialization sequence as specified by Intel in
\cite{intel:mp}.

At the basis of the multicore design is the symmetric execution of the kernel on
every available logical CPU. This means that all cores execute exactly the same
Muen kernel code, apart from the system bring up code run exclusively by the
BSP.

Global data is shared between all cores which necessitates some form of
inter-core synchronization\index{synchronization} to guarantee data
consistency. The synchronization primitives used by the separation kernel are
spinlock and barrier.

The spinlock\index{spinlock} facilitates the exclusive access of resources to a
CPU holding the spinlock. Before using a shared resource a logical CPU has to
acquire the lock and release it once it is finished. The subject's pending
event lists and multi-line debug output statements are protected by the
spinlock. Future improvements to the event and interrupt processing could
remove the need for the event list and in turn the locking, see section
\ref{subsec:apicv}.

A barrier\index{barrier} is used to guarantee, that all logical CPUs have
reached a specific execution point and are waiting for the release of the
barrier. This allows for synchronization of all CPUs to simultaneously start
executing instructions from a specific point in the kernel code. This mechanism
is used to manage major frame changes and avoid any inter-processor drift with
regards to scheduling plans.

The stack and a memory region called per-cpu storage is private to each CPU
core. That means CPU stack and per-cpu data is inaccessible by other CPU cores.

One globally shared variable is the major frame index which is used to
coordinate scheduling across all CPU cores. Its use and how a consistent view of
the variable's value across all cores is guaranteed is presented in the next
section.

An important aspect of Muen's multicore design is that subjects are pinned to a
specific logical CPU. Subjects do not migrate between cores and are exclusively
executed on the core defined by the associated subject specification in the
system policy. This removes complexity from the kernel and the overall system by
thwarting potential isolation issues which could be caused by the transfer of
subjects and their state between cores.

\subsection{Scheduling}\label{subsec:design-scheduling}
This section presents the design of the Muen kernel scheduler\index{scheduler}
and the selected scheduling algorithm.

In the context of this work, scheduling is defined as the process of selecting
a subject and giving it access to system resources for a certain amount of time.
The main resource is processor time, which enables a subject to execute and
perform its task.

A key objective of the scheduler is to provide temporal
isolation\index{temporal isolation} of all subjects. In order to meet this
requirement, the scheduler must prevent any interference between guests. To
achieve this, scheduling is done in a fixed, cyclic and preemptive manner.

Subjects are executed for a fixed amount of time, before being forcefully
preempted by the scheduler. Preemption means that regardless of what operations
a subject is performing, its execution is suspended when the allotted time
quantum has been consumed. After a subject has been suspended, the scheduler
executes the next subject for a given amount of time.

Scheduling information is declared in a \emph{scheduling plan}\index{scheduling
plan}. Such a plan is part of the policy and specifies in what order subjects
are executed on which logical CPU and for how long. The task of the scheduler is
to enforce a given scheduling regime.

A scheduling plan is specified in terms of frames. A \emph{major frame}
\index{major frame} consists of a sequence of minor frames. When the end of a
major frame is reached, the scheduler starts over from the beginning and uses
the first minor frame in a cyclic fashion. This means that major frames are
repetitive. A \emph{minor frame}\index{minor frame} specifies a subject and a
precise amount of time. This information is directly applied by the scheduler.

Figure \ref{fig:example-major-frame} illustrates the structure of a major frame.
The major frame consists of four minor frames. Minor frame 2 has twice the
amount of ticks than the other minor frames, which have identical length. Time
progresses on the horizontal axis from left to right.

When the major frame starts, subject 1 is scheduled for the length of minor
frame 1, followed by a switch to subject 2. After that the two subjects are
again scheduled in alternating fashion.

\begin{figure}[ht]
	\centering
	\input{graph_major_frame}
	\caption{Example major frame}
	\label{fig:example-major-frame}
\end{figure}

An example of a scheduling plan for multiple logical CPUs is depicted in figure
\ref{fig:example-scheduling-plan}. It illustrates a system with two logical CPUs
that execute various subjects indicated by different colors.

\begin{figure}[ht]
	\centering
	\input{graph_scheduling_plan}
	\caption{Example scheduling plan}
	\label{fig:example-scheduling-plan}
\end{figure}

CPU0 is executing the same subject for the whole duration of the major frame.
This could for example be the $\tau$0 subject executing on the bootstrap
processor (BSP). The second CPU is executing two subjects (blue and green) in
alternating order. As can be seen, subject green is granted more CPU cycles than
subject blue. All CPUs of the system wait on a barrier at the beginning of a new
major frame.  This guarantees that all logical CPUs of a system are in-sync when
major frames change.

On systems with multiple logical CPUs, a scheduling plan must specify a sequence
of minor frames for each processor core. In order for the cores to not run out
of sync, a major frame must be of equal length on all CPUs. This means that the
sum of all minor frame time slices of a major frame must amount to the same time
duration.

Since a system performs diverse tasks with different resource requirements,
there is a need for some flexibility with regards to scheduling. To provide this
degree of freedom while keeping the kernel complexity low, multiple scheduling
plans can be specified in the system policy. By defining a distinct plan for
each anticipated workload in the policy, the scheduling regimes are fixed at
integration time. This removes any runtime uncertainty that might be introduced
by this enhancement since the scheduling plans cannot be altered at runtime.

The privileged subject $\tau$0 is allowed to select and activate one of the
scheduling plans. A global variable termed \emph{major frame index} designates
the currently active scheduling plan. Its value is exclusively written by the
BSP while it is used by all cores to determine the currently active major frame.
The following protocol is employed to access the major frame index variable from
multiple CPUs:

\begin{enumerate}
	\item Each logical CPUs waits on the global barrier as soon as it has
		reached the end of the current major frame. The barrier remains closed
		until all system cores have reached that point.
	\item The barrier is released and all cores except the BSP immediately wait
		on the global barrier again.
	\item The BSP determines the scheduling plan selected by $\tau$0 and updates
		the major frame index value accordingly.
	\item The BSP continues to the barrier thereby opening it.
	\item All CPU cores including the BSP start executing after the barrier and
		use the updated major frame index value.
\end{enumerate}

This process guarantees that all CPU cores have a consistent view of the current
major frame index, simultaneous begin of a major frame cycle and that scheduling
plan changes only occur on major frame boundaries.
